Reconstructing (super)trees from data sets with missing distances: not all is lost.
نویسندگان
چکیده
The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, presents significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for data sets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial data sets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for data sets of all sizes, including large data sets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological data sets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial data sets.
منابع مشابه
Influence of Pattern of Missing Data on Performance of Imputation Methods: An Example from National Data on Drug Injection in Prisons
Background Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern...
متن کاملEstimating Phylogenies from Lacunose Distance Matrices: Additive is Superior to Ultrametric Estimation
Lapointe and Kirsch (1995) have recently explored the possibility of reconstructing phylogenetic trees from lacunose distance matrices. They have shown that missing cells can be estimated using the ultrametric property of distances, and that reliable trees can be derived from such filled matrices. Here, we extend their work by introducing a novel way to estimate distances based on the four-poin...
متن کاملTitle: A weighted least-squares approach for inferring phylogenies from incomplete distance matrices Authors:
Motivation: The problem of phylogenetic inference from data sets including incomplete or uncertain entries is among the most relevant issues in systematic biology. In this paper, we propose a new method for reconstructing phylogenetic trees from partial distance matrices. The new method combines the usage of the four-point condition and the ultrametric inequality with a weighted least-squares a...
متن کاملTitle : A weighted least - squares approach for inferring phylogenies from incomplete distance matrices
Motivation: The problem of phylogenetic inference from data sets including incomplete or uncertain entries is among the most relevant issues in systematic biology. In this paper, we propose a new method for reconstructing phylogenetic trees from partial distance matrices. The new method combines the usage of the four-point condition and the ultrametric inequality with a weighted least-squares a...
متن کاملA Global Maximum Likelihood Super-Quartet Phylogeny Method
Extending the idea of our previous algorithm [17, 18] we developed a new sequential quartet-based phylogenetic tree construction method. This new algorithm reconstructs the phylogenetic tree iteratively by examining at each merge step every possible super-quartet which is formed by four subtrees instead of simple quartet in our previous algorithm. Because our new algorithm evaluates super-quart...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Molecular biology and evolution
دوره 32 6 شماره
صفحات -
تاریخ انتشار 2015